Background
Microbiome sample clustering can be performed using either model-based methods and machine learning methods.
- Machine learning methods, which rely on defined distance metrics, are used more frequently than model-based statistical methods (“due to their efficient implementation and easy interpretation.”)
- I used the partition around medoids (PAM) clustering method, which is related to but considered more robust than K-means. In contrast to K-means, which can be sensitive to the effects of outliers, PAM’s optimization goal is to minimize the sum of distances to the medoids instead of minimizing the sum of the squared distances to the cluster centers.
Note: clustering was performed directly on distance matrices, not ordinations or ordination scores
Note the density of cluster 1 - I’ll investigate that further.
We do see clusters with only 1 site and others with multiple sites.
[in-progress]
| Cluster | total n | Site(s) | Grass(es) | Characteristics† | Exclusivity | |
|---|---|---|---|---|---|---|
| 1 | 183 | spans pH range | ||||
| 2 | 37 | BNP,DMT,FMT,SEV | BOER (n=33), BOGR (n=4) | high pH* (6.8 - 8.3, mean 7.5) | ||
| 3 | 56 | high pH**** (6.8 - 8.3, mean 7.8) | ||||
| 4 | 17 | lower pH**** (5.6 - 7.6, mean 6.2) | ||||
| 5 | 43 | high pH** (7.1 - 7.8, mean 7.6) | ||||
| 6 | 32 | lower pH**** (5.1 - 7.8, mean 6.2) | ||||
| 7 | 38 | |||||
| 8 | 29 | mostly LAR (n=27) | high pH**** (7.2 - 8.2, mean 8.0) | |||
| 9 | 9 | SFA | SCSC (only grass present) | Site=SFA | ||
| 10 | 29 | KAE | lower pH**** (5.9 - 7.2, mean 6.3) | Site=KAE (of the 32 KAE samples, only 3 others were in diff clusters) |
Clusters: Sørensen dissimilarity clusters based on pam (k = 10)
Method: R package randomForest v4.7.1.1
predictors.all<-t(otu_table(Fun_wholecommunity))
response.clus_sor_k10<-as.factor(sample_data(Fun_wholecommunity)$clus_sor_k10)
rf.data.clus_sor_k10<-data.frame(response.clus_sor_k10, predictors.all)
classify.clus_sor_k10<-randomForest(response.clus_sor_k10~., data = rf.data.clus_sor_k10, ntree=999)
Call: randomForest(formula = response.clus_sor_k10 ~ ., data = rf.data.clus_sor_k10, ntree = 999)
Type of random forest: classification
Number of trees: 999
No. of variables tried at each split: 81
OOB estimate of error rate: 6.82%
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Cluster error % | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 183 | 0.0 | |||||||||
| 2 | 34 | 3 | 8.1 | ||||||||
| 3 | 5 | 50 | 1 | 10.7 | |||||||
| 4 | 12 | 4 | 1 | 29.4 | |||||||
| 5 | 1 | 50 | 1 | 1 | 5.7 | ||||||
| 6 | 7 | 23 | 2 | 28.1 | |||||||
| 7 | 1 | 2 | 34 | 1 | 10.5 | ||||||
| 8 | 1 | 27 | 1 | 6.9 | |||||||
| 9 | 10 | 0 | |||||||||
| 10 | 1 | 28 | 3.4 |
Fungal whole community
Clusters 2 & 3 seem to be the outgroup / most distantly related
Clusters 1 & 9 are very similar, followed by 10 -> wrap-around cluster numbering
average - Species contribution to average between-group dissimilarity
cusum - Ordered cumulative contribution (to between-group dissimilarity). These are based on item average, but they sum up to total 1
p - Permutation \(p\)-value. Probability of getting a larger or equal average contribution in random permutation of the group factor
Mean Sørensen dissimilarity between Clusters 1 & 2: 0.920 (unweighted) / 0.903 (weighted by sample size)
Mean within-cluster similarities
- Cluster 1: 0.514
- Cluster 2: 0.802